Decision Trees: Next Level

Measuring Success

Gini Index and Cost Complexity

Recall: Our simplest cannabis tree

  • Which of the final nodes (or leaves) is most pure?

  • Which is least pure?

  • Could we split a node further for better purity?

    • Almost certainly, yes! It’s highly unlikely that all of the unused variable have exactly the same prevalence across categories.

    • Should we do it, or is that overfitting?

Classification Error

  • What is the classification error of each leaf?

  • (Left to right)

    • 0.35
    • 0.37
    • 0.46
    • 0.36
rpart.plot(tree_fitted$fit, 
           roundint = FALSE, 
           cex = 1.25)

Gini Index

The Gini Index for a particular leaf (not overall) is the average of errors in each class:


\((0.35*0.65) + (0.21*0.79) + (0.14*0.86) = 0.5138\)


  • small values if the classification errors are close to 0, i.e., high leaf purity
  • large values (near 1) if the errors are high
  • this is related to the variance of the leaf

Calculating the Gini Index

To calculate the Gini Index average across all leaves (:

cann %>%
  bind_cols(
    predict(tree_fitted, new_data = cann, type = "prob")
  ) %>%
  gain_capture(truth = Type,
               .pred_hybrid, .pred_indica, .pred_sativa)
# A tibble: 1 × 3
  .metric      .estimator .estimate
  <chr>        <chr>          <dbl>
1 gain_capture macro          0.482

Cost Complexity Revisited

So, when should we split the tree further?

Only if the new splits improve the Gini Index by a certain amount.

This is the cost_complexity parameter!

But wait! This is a penalized metric, using an arbitrary penalty \(\alpha\) to avoid overfitting.

Don’t we like cross-validation better?

Well… yes.

But imagine fitting every possible tree and cross-validating…. yikes.

We have to limit our options and cut our losses somehow!

Bagging

Tree Variability

Suppose I took two random subsamples of my cannabis dataset:

set.seed(9374534)

splits <- cann %>% 
  initial_split(0.5, strata = Type)

cann_1 <- splits %>% training()
cann_2 <- splits %>% testing()



dim(cann_1)
[1] 1151   69
dim(cann_2)
[1] 1154   69

Tree Variability

Then I fit a decision tree to each:

tree_1 <- tree_wflow %>%
  fit(cann_1)

tree_2 <- tree_wflow %>%
  fit(cann_2)

How similar will the results be?

Tree 1

tree_1 %>% 
  pull_workflow_fit() %$%
  fit %>%
  rpart.plot(roundint = FALSE)

Tree 2

tree_2 %>% 
  pull_workflow_fit() %$%
  fit %>%
  rpart.plot(roundint = FALSE)

Tree Variability

So… which tree should we believe?

More Subsamples!

Let’s take several subsamples of the data, and make trees from each.

Then, to classify a new observation, we run it through all the trees and let them vote!

(It’s a bit like a KNN for trees!)

This is called bagging.

Bagging – Setup

library(baguette)

bag_tree_spec <- bag_tree() %>%
  set_engine("rpart", times = 5) %>%
  set_mode("classification")

Bagging – Fitting

bag_tree_wflow <- workflow() %>%
  add_recipe(cann_recipe) %>%
  add_model(bag_tree_spec)

bag_tree_fit <- bag_tree_wflow %>%
  fit(cann)

Caution

This step can take awhile! Be patient!

Bagging

What variables were most important to the trees?

bag_tree_fit %>% 
  extract_fit_parsnip()
parsnip model object

Bagged CART (classification with 5 members)

Variable importance scores include:

# A tibble: 63 × 4
   term      value std.error  used
   <chr>     <dbl>     <dbl> <int>
 1 Rating    321.       6.74     5
 2 Sleepy    174.       4.07     5
 3 Focused    65.6      4.35     5
 4 Sweet      65.5      3.93     5
 5 Creative   64.8      6.08     5
 6 Relaxed    64.3      8.59     5
 7 Earthy     62.5      2.24     5
 8 Euphoric   62.2      6.83     5
 9 Uplifted   60.2      2.31     5
10 Energetic  59.9      4.53     5
# ℹ 53 more rows

Random Forests

Random Forests

What if some important variables are being masked by more important variables?

Remember, we have 65 predictors - yikes! So, let’s give some of the other predictors a chance to shine.

Randomly Choosing Predictors

Randomly choose a set of the predictors to include in the data:

cann_reduced <- cann %>%
  select(Type, 
         sample(5:65, size = 30)
         )

cann_reduced
# A tibble: 2,305 × 31
   Type   Orange Minty Pungent Strawberry Relaxed  Tree  Sage Pineapple Mouth
   <fct>   <dbl> <dbl>   <dbl>      <dbl>   <dbl> <dbl> <dbl>     <dbl> <dbl>
 1 hybrid      0     0       0          0       1     0     0         0     0
 2 hybrid      0     0       0          0       1     0     0         0     0
 3 sativa      0     0       0          0       1     0     1         0     0
 4 hybrid      0     0       0          0       1     0     0         0     0
 5 hybrid      1     0       0          0       1     0     0         0     0
 6 indica      0     0       0          0       0     0     0         0     0
 7 hybrid      0     0       1          0       1     0     0         0     0
 8 indica      0     0       1          0       1     0     0         0     0
 9 sativa      0     0       0          0       1     0     0         0     0
10 indica      0     0       0          0       1     0     0         0     0
# ℹ 2,295 more rows
# ℹ 21 more variables: Diesel <dbl>, Grape <dbl>, Coffee <dbl>, Lavender <dbl>,
#   Mango <dbl>, Sleepy <dbl>, Tingly <dbl>, Flowery <dbl>, Sweet <dbl>,
#   Creative <dbl>, Talkative <dbl>, Giggly <dbl>, Chestnut <dbl>, Skunk <dbl>,
#   Tropical <dbl>, Ammonia <dbl>, Nutty <dbl>, Lime <dbl>, Dry <dbl>,
#   Chemical <dbl>, Citrus <dbl>

A New Decision Tree

cann_recipe_2 <- recipe(Type ~ ., 
                     data = cann_reduced)

tree_fit_reduced <- workflow() %>%
  add_recipe(cann_recipe_2) %>%
  add_model(tree_mod) %>%
  fit(cann_reduced)

tree_fit_reduced %>% 
  extract_fit_parsnip() %$% 
  fit %>% 
  rpart.plot(roundint = FALSE)

Random Forests – Final Model

After making many random reduced trees, we then bag the results to end up with a random forest.

The advantage of this is that more unique variables are involved in the process.

This way, we don’t accidentally overfit to a variable that happens to be extremely relevant to our particular dataset.

Your turn

Open the Activity-RF-Bagging.qmd activity file

  1. Find the best bagged model for the cannabis data
  2. Find the best random forest model for the cannabis data
  3. Report some metrics on your models

Reference for Fitting Trees

Don’t forget you can use the reference guide (in the R References Module on Canvas) for guidance on how to fit these models!